Gavin Gray 9th June 2014

Ran into problems with code in previous notebook due to complicated way of mapping from the list of DIP entry proteins to the uniprot identifiers to the Entrez identifiers. Simplifying this process to avoid erroneous entries. Writing code to:

Extract DIP protein identifier entries:
1. and their interaction relationships
Convert these DIP protein identifiers to Entrez and Uniprot identifiers
Link these identifiers to their sequences and NCBI database entries.

Extracting DIP protein identifiers

Code from previous notebook worked fine for this, but failed to preserve information about which interactions existed. Making a few small changes to ensure that works.



In [8]:

    
cd /home/gavin/Documents/MRes/DIP/human/









    



/home/gavin/Documents/MRes/DIP/human



In [9]:

    
ls









    



DIPtouniprot.tab  flat.DIP.txt  flat.Entrez.txt  flat.uniprot.txt  Hsapi20140427.txt  interacting.DIP.txt  interacting.Entrez.txt  interacting.uniprot.txt  uniprottoEntrez.tab



In [10]:

    
import csv
c = csv.reader(open("Hsapi20140427.txt"), delimiter="\t")
#skip first line
c.next()
#take out the first two strings for every line
IDstrings = map(lambda x: (x[0],x[1]), c )



In [11]:

    
# want to go through this and remove the uniprot and refseq identities
# unfortunately, fastest way to write this is with a list comprehension
IDstrings = [(x.split("|")[0],y.split("|")[0]) for (x,y) in IDstrings if "DIP" in x.split("|")[0] and "DIP" in y.split("|")[0]]



In [12]:

    
from collections import OrderedDict
#flatten and remove duplicate entries, but keep the original structure
flatIDstrings = list(OrderedDict.fromkeys(flatten(IDstrings)))
print "Number of proteins in DIP human dataset is %i"%len(flatIDstrings)









    



Number of proteins in DIP human dataset is 4026

Now there are two variables in use:

IDstrings - A list of pairs of interacting proteins in the DIP format.
flatIDstrings - A list of all proteins in the above list, with redundancies removed.

The next step is to convert both of these lists to Uniprot and Entrez identifiers. However, this is not a one to one tranformation. Some identifiers could map to two identifiers or none. This will have to be taken into account to avoid interactions between protein identifiers that don't exist.

The dictionary can be built using the Uniprot ID mapping service with the flat file used as input. Doing this, we can load this file back in and create a Python dictionary from it:



In [13]:

    
#writing the flattened protein IDs to file
c = csv.writer(open("flat.DIP.txt", "w"), delimiter="\n")
c.writerow(flatIDstrings)

The Uniprot ID mapping service returned:

3,509 out of 3,844 identifiers mapped to 3,467 identifiers in the target data set

Reading in this file:



In [14]:

    
c = csv.reader(open("DIPtouniprot.tab"), delimiter="\t")
# skip first row
c.next()
cl = list(c)
# build dictionary
DIPtouni = {}
for l in cl:
    #place an empty list in for every DIP protein mapped
    DIPtouni[l[0]] = []
for l in cl:
    #fill the lists with uniprot identifiers
    DIPtouni[l[0]].append(l[1])

Now can convert arbitrary DIP identifiers to uniprot:



In [15]:

    
print DIPtouni[flatIDstrings[8]]









    



['Q92793']

Converting the entire set of DIP identifiers to Uniprot while:

Discarding any that don't map
Creating duplicate entries for those that map to multiple identifiers



In [16]:

    
import itertools
uniIDstrings = []
faildict = {}
for l,k in zip(IDstrings,range(len(IDstrings))):
    # try to retreive entries for these proteins
    try:
        unip = map(lambda x: DIPtouni[x], l)
        if len(unip[0]) >= 1 or len(unip[1]) >= 1:
            #iterate over combinations of both:
            for i in itertools.product(unip[0],unip[1]):
                uniIDstrings.append(i)
    except KeyError:
        faildict[k] = l
print "Failed to map %i of %i entries."%(len(faildict.values()), len(IDstrings))









    



Failed to map 976 of 5951 entries.

The flattened list of protein identifiers must also be mapped to Uniprot, but this is simply the values contained in the dictionary used to convert from DIP to Uniprot identifiers. Writing these to a file these can then be used to convert to Entrez identifiers:



In [17]:

    
c = csv.writer(open("flat.uniprot.txt", "w"), delimiter="\n")
flatuniprot = list(flatten(DIPtouni.values()))
c.writerow(flatuniprot)

Using this on the Uniprot mapping service returns:

3,325 out of 3,511 identifiers mapped to 3,437 identifiers in the target data set

Can then load in the resulting table as a dictionary again, and use this to convert the uniprot list of interacting pairs:



In [18]:

    
c = csv.reader(open("uniprottoEntrez.tab"), delimiter="\t")
# skip first row
c.next()
cl = list(c)
# build dictionary
unitoEnt = {}
for l in cl:
    unitoEnt[l[0]] = []
for l in cl:
    unitoEnt[l[0]].append(l[1])



In [19]:

    
import itertools
EntIDstrings = []
faildictEnt = {}
for l,k in zip(uniIDstrings,range(len(uniIDstrings))):
    # try to retreive entries for these proteins
    try:
        unip = map(lambda x: unitoEnt[x], l)
        if len(unip[0]) >= 1 or len(unip[1]) >= 1:
            #iterate over combinations of both:
            for i in itertools.product(unip[0],unip[1]):
                EntIDstrings.append(i)
    except KeyError:
        faildictEnt[k] = l
print "Failed to map %i of %i entries."%(len(faildictEnt.values()), len(uniIDstrings))









    



Failed to map 241 of 4977 entries.

Which produces the list of interacting pairs in Entrez format. Some example entries can be shown:



In [20]:

    
print EntIDstrings[0:10]









    



[('5894', '596'), ('596', '7157'), ('2033', '7157'), ('2118', '2033'), ('1385', '1387'), ('4603', '1387'), ('2033', '7528'), ('6720', '1387'), ('101839559', '6929'), ('4654', '6929')]

Saving the lists

There are many files we wish to make from this processed data. The naming scheme for these files is given below:

List of flattened gene identifiers:
- DIP IDs - flat.DIP.txt
- Uniprot IDs - flat.uniprot.txt
- Entrez IDs - flat.Entrez.txt
Lists of interacting pairs:
- DIP - interacting.DIP.csv
- Uniprot - interacting.uniprot.csv
- Entrez - interacting.Entrez.cvs

It's also important that the entries which failed to map are removed from the previous identifier's list, to make all lists compatible. This can be achieved easily using the faildict variables defined above:



In [21]:

    
#use indexes in faildict to zero entries we're not interested in
for k in faildict.keys():
    IDstrings[k] = 0
#the use ifilter to remove these entries
IDstrings = list(itertools.ifilter(lambda x: x!=0, IDstrings))



In [22]:

    
#then do the same for uniIDstrings
for k in faildictEnt.keys():
    uniIDstrings[k] = 0
uniIDstrings = list(itertools.ifilter(lambda x: x!=0, uniIDstrings))



In [23]:

    
#redefine the flattened list of uniprot IDs
flatuniprot = list(unitoEnt.keys())
#remove elements from flattened DIP IDs that don't map to Entrez
flatIDstrings = [x for x in DIPtouni.keys() if all([a in flatuniprot for a in DIPtouni[x]])]
#regenerate flat files:
csv.writer(open("flat.DIP.txt", "w"), delimiter="\n").writerow(flatIDstrings)
csv.writer(open("flat.uniprot.txt", "w"), delimiter="\n").writerow(flatuniprot)



In [24]:

    
#then save the flattened Entrez IDs
flatEnt = list(flatten(unitoEnt.values()))
csv.writer(open("flat.Entrez.txt", "w"), delimiter="\n").writerow(flatEnt)



In [25]:

    
#then save all of the interaction tables
csv.writer(open("interacting.DIP.txt", "w"), delimiter="\t").writerows(IDstrings)
csv.writer(open("interacting.uniprot.txt", "w"), delimiter="\t").writerows(uniIDstrings)
csv.writer(open("interacting.Entrez.txt", "w"), delimiter="\t").writerows(EntIDstrings)

Checking crossover with Bait and Prey

At this point it seems worthwhile to check if there are Bait or Prey proteins in our experimental results which are represented in this list of interacting proteins. As this is our ground truth which will be used to train our prediction algorithm any interactions present in the DIP dataset must be treated as real. In other words, our prediction algorithm can't outperform the training set.

Loading in the bait and prey lists and combining them:



In [26]:

    
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS/









    



/home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS



In [27]:

    
ls









    



ensembl_prey_human_ids.csv  prey_entrez_ids.csv  preys.csv



In [28]:

    
%%bash
head prey_entrez_ids.csv



In [29]:

    
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"))))



In [30]:

    
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS/









    



/home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS



In [31]:

    
ls









    



baits.csv  baits_entrez_ids_ActiveZone.csv  baits_entrez_ids.csv  ensembl_bait_human_ids.csv



In [32]:

    
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"))))



In [33]:

    
#check the crossover
crossover = [x for x in baitids+preyids if x in flatEnt]



In [34]:

    
print "%i of %i Entrez IDs in bait and prey lists are found in DIP dataset"%(len(crossover),len(baitids+preyids))









    



493 of 1933 Entrez IDs in bait and prey lists are found in DIP dataset

So about a quarter of the proteins we're interested in are found in the DIP dataset. However, this doesn't mean that a quarter of the interactions we're trying to find are already solved for. There could be interactions between any of these crossover proteins that are not accounted for in the DIP dataset.

Crossover with bait set

Colin asked what this is, and it's pretty easy to calculate:



In [35]:

    
baitcrossover = [x for x in baitids if x in flatEnt]
print "%i of %i Entrez IDs in bait list are found in DIP dataset"%(len(baitcrossover),len(baitids))









    



21 of 89 Entrez IDs in bait list are found in DIP dataset